Disproportionate effects of COVID-19 on majority African American communities in the U.S.

Author: Rachel Smith

COVID-19 has been dominating our thoughts, our lives, and the news for months now. As this deadly pandemic ravages the world, the news have been reporting that racial disparities have deadly implications for African Americans. Reports suggest an overrepresentation of infections, hospitalizations, and deaths for African Americans compared to their white counterparts. This is unsuprising for countless reasons, but I wanted to dig into the data for myself. There are many different ways to approach this analysis, but for simplicity's sake, I use data reporting COVID-related deaths by county and match that to 2010 US census data reporting racial demographics by county. Here, I show data demonstrating that majority black communities are being disproportinately affected by COVID-19.

Install and import packages

In [1]:
#pip install chart_studio
In [2]:
#pip install "notebook>=5.3" 
In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from datetime import datetime as dt
import seaborn as sns
import chart_studio
import chart_studio.plotly as py
import plotly
import plotly.graph_objs as go
import plotly.express as px
import plotly.io as pio

Task 1: Explore COVID-19 deaths data. How have COVID deaths been progressing state-wise? How are these deaths distributed across counties?

Load data

In [4]:
url = "https://usafactsstatic.blob.core.windows.net/public/data/covid-19/covid_deaths_usafacts.csv"
df_us_deaths = pd.read_csv(url)

Link to data: https://usafacts.org/visualizations/coronavirus-covid-19-spread-map/

The data I will be using to show COVID-19 death rates comes from USAFacts.org. USAFacts lists cumulative deaths in each county in each state of the US starting 1/22/20, and includes state and county FIPS. These codes will come in handy later for merging dataframes. USAFacts also has separate data sets for confirmed cases and population adjustments. I will be using confirmed deaths as a metric for severity, and conducting my own county-based population adjustments.

Explore & format data

In [5]:
df_us_deaths.head() #check out the data!
Out[5]:
countyFIPS County Name State stateFIPS 1/22/20 1/23/20 1/24/20 1/25/20 1/26/20 1/27/20 ... 4/20/20 4/21/20 4/22/20 4/23/20 4/24/20 4/25/20 4/26/20 4/27/20 4/28/20 4/29/20
0 0 Statewide Unallocated AL 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 1001 Autauga County AL 1 0 0 0 0 0 0 ... 1 1 1 2 2 2 2 3 4 4
2 1003 Baldwin County AL 1 0 0 0 0 0 0 ... 1 2 2 2 2 2 2 2 2 2
3 1005 Barbour County AL 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 1
4 1007 Bibb County AL 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 103 columns

In [6]:
df_us_deaths.shape # There are 3195 counties in the US, including unallocated territories
Out[6]:
(3195, 103)
In [7]:
df_us_deaths.isnull().values.any() # yayyy we have all our data!!
Out[7]:
False
In [8]:
#Format column names


#probably going to want dates in datetime format

df_labels = df_us_deaths.iloc[:,:4]

df_dates = df_us_deaths.iloc[:, 4:]

names_old = df_dates.columns.tolist()
    
names_new = []
    
for i in names_old :
    dtobject = dt.strptime(i, "%m/%d/%y").strftime("%m-%d-%Y")
    names_new += [dtobject]
        
df_dates.columns = names_new

df_us_deaths = pd.concat([df_labels, df_dates], axis = 1)

#also don't want that space in the "County Name" column

df_us_deaths = df_us_deaths.rename(columns = {"County Name":"County"})

First, let's get an overview of how deaths in each state have been progressing over time

In [9]:
#get rid of county info (for now)

df_state_deaths = df_us_deaths.iloc[:, 2:]


#group state rows together

df_state_deaths = df_state_deaths.drop(columns = ["stateFIPS"])
df_deaths_by_state = df_state_deaths.groupby(["State"], as_index=False).agg("sum")


#reshape data frame

df_deaths_by_state = pd.melt(df_deaths_by_state, id_vars = "State").rename(columns = {"variable": "Date", "value": "Deaths"})


#rows with all zeros (no deaths) aren't very informative for us...

df_deaths_by_state = df_deaths_by_state.loc[df_deaths_by_state["Deaths"] != 0, :]


# plot!!

plot = px.line(df_deaths_by_state, 
               x='Date', 
               y='Deaths', 
               color='State',
              title = "Deaths by state over time",
              width = 1000,
              height = 700)

plotly.offline.iplot(plot)

The plots I use here are interactive through plotly. Double clicking on an item in the legend or on a line in the chart will isolate it so you can view only that data. Then, single clicking on additional states will add them to the chart for comparison. Hovering over a specific data point will give you the cumulative deaths up to that specific date in that specific state.

Now, let's get a general idea of the distribution of deaths throughout counties

In [10]:
# Extract the total number of deaths to date per county

df_us_deaths['County,State'] = df_us_deaths[['County', 'State']].agg(', '.join, axis=1)

df_county_deaths = df_us_deaths.drop(columns = ["State", "stateFIPS", "County"])

df_county_totals = df_county_deaths.iloc[:, [0,-1,-2]]
df_county_totals = df_county_totals.rename(columns = {df_county_totals.columns[-1]:"Deaths"})


#also, there are a lot of zeros so I'll take these out for visualization purposes...

df_county_totals_deaths = df_county_totals.loc[df_county_totals["Deaths"] != 0, :]


# plot!!


hist = px.histogram(df_county_totals_deaths, 
                    x="Deaths",
                   nbins = 100,
                   log_y = True,
                   title = "Distribution of deaths by county", 
                   marginal = "violin",
                    hover_data = ["County,State", "Deaths"],
                    width = 1000,
                    height = 600
                   )

hist.update_layout(yaxis_title_text = 'Number of counties')

plotly.offline.iplot(hist)

This histogram shows the distribution of counties based on COVID-related deaths. This is interesting as a snapshot, but what we really want to look at with this analysis is the racial demographics of those harder-hit counties. Time to bring in the census data...

Task 2: Explore census data. What racial identities should we include in this analysis based on prevalence?

Load Data

In [11]:
#Get census data


al_mo_url = "https://www2.census.gov/programs-surveys/popest/datasets/2010/modified-race-data-2010/stco-mr2010_al_mo.csv"
df_al_mo = pd.read_csv(al_mo_url)

mt_wy_url = "https://www2.census.gov/programs-surveys/popest/datasets/2010/modified-race-data-2010/stco-mr2010_mt_wy.csv"
df_mt_wy = pd.read_csv(mt_wy_url, encoding = 'latin-1')

df_census = pd.concat([df_al_mo, df_mt_wy], ignore_index=True)

The data I will be using to determine racial demographics comes from the United States Census Bureau. Of note, this data comes from the last nationwide census in 2010. Subsequent censuses have only included areas above a certain population threshold (65,000 people), which may disclude areas of interest from this analysis. Thus, until 2020 census data is publicly available, 2010 will have to do.

This data set includes information about sex, Hispanic origin, age group, and race for each county by FIPS. I am interested in looking at racial demographics, but a lot of other cool analyses could be performed with this information.

Format data

In [12]:
#for this project, I'm interested in race by region

df_census = df_census.drop(columns = ["SUMLEV", "SEX", "AGEGRP"])


#Gotta fix these column names too

dict_names = {"STATE":"stateFIPS",
              "COUNTY":"countyFIPS",
              "STNAME":"State",
             "CTYNAME":"County",
             "ORIGIN":"Hispanic",
             "IMPRACE":"Race",
             "RESPOP":"Num_res"}

df_census = df_census.rename(columns = dict_names)
In [13]:
#The census countyFIPS are in a different format that the USAfacts countyFIPS :(


a = df_census["stateFIPS"]
b = df_census["countyFIPS"]

df_census.loc[b < 10, "countyFIPS"] = a.apply(str) + "00" + b.apply(str)

df_census.loc[b >= 100, "countyFIPS"] = a.apply(str) + b.apply(str)

df_census.loc[(b >= 10) & (b < 100), "countyFIPS"] = a.apply(str) + "0" + b.apply(str)

df_census["stateFIPS"] = df_census["stateFIPS"].astype(int)
df_census["countyFIPS"] = df_census["countyFIPS"].astype(int)

#df_census.head(10)
In [14]:
df_census["Num_res"].sum() #about 300 million people in the US as of 2010... math checks out
Out[14]:
308745538

Determine what racial groups to include in analysis

There are 31 different race categories in the US census data, most of which are mixed. It would be difficult to categorize and find meaningful data based on all of these categories, so I'm going to determine which categories comprise the majority of the population and run the analysis based on these categories.

Also, census information separates Hispanic origin from race. I'm going to add all residents who identify as having Hispanic origin to a separate race, and not include these people in the racial group they had initially chose. (i.e. "Hispanic white" --> "Hispanic", "non-Hispanic white" --> "white")

In [15]:
#Create separate race category for everyone who identifies as Hispanic (Race "0")


#if the person identifies as Hispanic, add them to Race 0

df_census.loc[df_census.Hispanic == 2, "Race"] = 0
In [16]:
#Look at histogram of how prevalent these races are to determine what to include in analysis

df_pop_by_race = df_census.groupby(["Race"], as_index=False).sum().drop(columns = ["stateFIPS", "countyFIPS", "Hispanic"]).sort_values(by = "Num_res", ascending = False)


pop = px.bar(df_pop_by_race, 
             x='Race', 
             y='Num_res',
             title = "Racial makeup of the US",
            width = 1000,
            height = 600)

pop.update_layout(yaxis_title_text = 'Number of residents', 
                  xaxis_type = 'category')

plotly.offline.iplot(pop)

Based on the above plot, races 1, 0, 2, 4, 3, 6, 8, & 7 make up the overwhelming majority of the US population, so this analysis focuses on those categories, where:

  1. = Hispanic
  2. = White alone
  3. = Black or African American alone
  4. = American Indian and Alaskan Native alone
  5. = Asian alone
  6. = White and Black or African American
  7. = White and American Indian and Alaskan Native
  8. = White and Asian
In [17]:
races = [1, 0, 2, 4, 3, 6, 8, 7]

df_census = df_census[df_census["Race"].isin(races)].drop(columns = ["Hispanic"])

To further condense our list, I'm combining the biracial categories with the corresponding non-white race. By all accounts, these people still experience racism. As such, our list will consist of just

  1. = Hispanic
  2. = White
  3. = Black of African American (incl. 6)
  4. = American Indian and Alaskan Native (incl. 7)
  5. = Asian (incl. 8)
In [18]:
df_census.loc[df_census.Race == 6, "Race"] = 2
df_census.loc[df_census.Race == 7, "Race"] = 3
df_census.loc[df_census.Race == 8, "Race"] = 4


#Replace race number indicator with actual race

df_census["Race"] = df_census["Race"].replace({0: "Hispanic",
                          1: "White",
                          2: "Black",
                          3: "American Indian",
                          4: "Asian"})
In [19]:
#Look at US demographics based on these major groups


df_pop_by_race1 = df_census.groupby(["Race"], as_index=False).sum().drop(columns = ["stateFIPS", "countyFIPS"]).sort_values(by = "Num_res", ascending = False)

df_pop_by_race1["Percent of total population"] = ((df_pop_by_race1["Num_res"]/df_pop_by_race1["Num_res"].sum())*100).round(2)


pop = px.bar(df_pop_by_race1, 
             x='Race', 
             y='Percent of total population',
             title = "US Demographics",
             width = 1000,
             height = 600
            )

pop.update_layout(yaxis_title_text = 'Percent of total population')

plotly.offline.iplot(pop)

Here you have the US racial demographics as of 2010 based on the top 5 most prevalent race categories. These are the races that will be included in our analysis.

Task 3: Look at census data and COVID-19 data together. Do counties that have a majority population that identifies as non-white have a disproportionately high death rate?

Merge data frames

In [20]:
# sum residents of each race by state

df_census_by_region = df_census.groupby(["stateFIPS", "countyFIPS", "Race"], as_index = False).agg({"Num_res":"sum"})


# Merge residents by race of each region with COVID deaths of each region

df_region_race_deaths = pd.merge(df_census_by_region, df_county_totals, on = ["countyFIPS"])

#Add percent race by county as a column

dfx = df_region_race_deaths.groupby(["countyFIPS"], as_index = False).agg({"Num_res":"sum"}).rename(columns = {"Num_res":"total_res"})
df_percents_by_county = pd.merge(df_region_race_deaths, dfx, on = "countyFIPS")

df_percents_by_county.loc[:, "percent_race"] = ((df_percents_by_county["Num_res"]/df_percents_by_county["total_res"])*100).round(2)


#Add percent death by county as a column

df_percents_by_county.loc[:, "percent_death"] = ((df_percents_by_county["Deaths"]/df_percents_by_county["total_res"])*100).round(5)

df_percents_by_county = df_percents_by_county[["County,State",'total_res','Deaths',"percent_death", 'Race','Num_res', 'percent_race']]

pd.set_option('display.max_rows', None)
df_percents_by_county
Out[20]:
County,State total_res Deaths percent_death Race Num_res percent_race
0 Autauga County, AL 54434 4 0.00735 American Indian 468 0.86
1 Autauga County, AL 54434 4 0.00735 Asian 649 1.19
2 Autauga County, AL 54434 4 0.00735 Black 9813 18.03
3 Autauga County, AL 54434 4 0.00735 Hispanic 1310 2.41
4 Autauga County, AL 54434 4 0.00735 White 42194 77.51
... ... ... ... ... ... ... ...
15666 Weston County, WY 7203 0 0.00000 American Indian 155 2.15
15667 Weston County, WY 7203 0 0.00000 Asian 30 0.42
15668 Weston County, WY 7203 0 0.00000 Black 36 0.50
15669 Weston County, WY 7203 0 0.00000 Hispanic 216 3.00
15670 Weston County, WY 7203 0 0.00000 White 6766 93.93

15671 rows × 7 columns

Now we have a data frame that gives us information regarding the total number of residents by race and the total number of deaths due to COVID-19, as well as the percentages of each normalized to respective county population.

In [21]:
#Just interested in the demographics of my home county...


df_lebanon = df_percents_by_county.loc[df_percents_by_county["County,State"]=="Lebanon County, PA", :]
#df_lebanon
In [22]:
#And of DC....


df_dc = df_percents_by_county.loc[df_percents_by_county["County,State"]=="Washington, DC", :]
#df_dc

How are counties with majority non-white (specifically, African American or Black) population disproportionately affected by this pandemic?

1. Do majority non-white communities have disproportionately high death rates due to COVID-19, considering their prevalence in the US?

In [29]:
# counties that have a majority race, in order of decreasing percent death

df_majority_counties = df_percents_by_county.loc[df_percents_by_county.groupby("County,State")["percent_race"].idxmax()].drop(columns = ["total_res", "Deaths", "Num_res"]).sort_values("percent_death", ascending = False)
df_majority_counties = df_majority_counties.loc[df_majority_counties["percent_race"] > 50, :]


# pull counties with highest percent death

df_top_counties = df_majority_counties.head(10)


#pull top counties by race

df_tc_wh = df_top_counties.loc[df_top_counties["Race"] == "White",:]
df_tc_co = df_top_counties.loc[df_top_counties["Race"] != "White",:]


#calculate majority counties (again) and divide by race

df_maj_wh = df_majority_counties.loc[df_majority_counties["Race"] == "White",:]
df_maj_co = df_majority_counties.loc[df_majority_counties["Race"] != "White",:]


#calculate percentage of total counties

percent_white = (df_tc_wh.shape[0]/df_maj_wh.shape[0])*100
percent_poc = (df_tc_co.shape[0]/df_maj_co.shape[0])*100


#plot 

df_tc = pd.DataFrame({"Race": ["Majority white", "Majority POC"],
                                "% counties in top 10 affected by COVID":[percent_white, percent_poc]})

bar4 = px.bar(df_tc, 
              x = "Race",
              y = "% counties in top 10 affected by COVID",
              color = "Race",
              color_discrete_sequence=px.colors.sequential.Rainbow,
             title = "Percent of counties within race in the top 10 highest death rates",
             hover_data = ["% counties in top 10 affected by COVID"],
             width = 600,
             height = 600)

bar4.update_yaxes(tickprefix = "%")

plotly.offline.iplot(bar4)

As of 4/29, 2.82% of counties in the US with a majority non-white population are in the top 10 counties with the highest percent death by COVID-19 according to percent death rate. Only 0.14% of majority white counties are in this list.

2. Narrowing the focus on the portion of the population that identifies as Black or African American or non-Hispanic white, how does the racial makeup of these counties affect their death rates?

In [24]:
# find the percent race based on white and black for each county

race = df_percents_by_county["Race"]
df_race = df_percents_by_county.loc[(race == "White") | (race == "Black"), :]


#plot

px.defaults.width = 800
px.defaults.height = 800

scat = px.scatter(df_race,
                    x = "percent_race", 
                    y = "percent_death",
                    size = "percent_death",
                  hover_data = ["County,State", "percent_race", "percent_death"],
                   facet_row = "Race",
                  color = "percent_death",
                  color_continuous_scale=px.colors.sequential.Burgyl
                 )

scat.update_yaxes(tickprefix = "%")

scat.update_layout(xaxis_title_text = 'Percent county pop that identifies as respective race',
                  title = "Death rate by racial makeup of county"
                    )

plotly.offline.iplot(scat)

The above plot demonstrates the relationship between the percentage of the county population that is either white or black, and the percentage of the county population that died due to COVID-19. Big bubbles in the top right quadrant of either plot represent counties with a high percentage of that particular race as well as a high number of deaths relative to county population size. The plot representing people who identify as Black or African-American has several of these markers, indicating higher death rates in majority Black counties, while the plot representing people who identify as non-Hispanic white does not.

Next, we focus in on counties where the majority of the population identifies as either Black/African American or white. We can determine the majority race of a county by defining it as greater than 50% for that county. The rest of this analysis will focus on "majority black" and "majority white" counties in this way.

In [25]:
# Determines counties where the majority race is the race number inputted, majority defined as > 50%


def majority_race(race):
    
    r = df_percents_by_county["percent_race"]
    
    df_majority = df_percents_by_county.loc[r > 50, :]
    
    df_majority = df_majority.loc[df_majority["Race"] == race, :]
    
    return df_majority


#create data frames based on majority race

df_black_majority = majority_race("Black")
df_white_majority = majority_race("White")
#df_hisp_majority = majority_race("Hispanic")
#df_asian_majority = majority_race("Asian")
#df_native_majority = majority_race("American Indian")

3. Narrowing the focus even further to look at counties that are majority white vs majority black, how is the percent of the county population that died due to COVID-19 distributed across these counties?

In [26]:
# combine majority white and majority black data frames

df_race_majority = pd.concat([df_white_majority, df_black_majority])


# plot

px.defaults.width = 800
px.defaults.height = 600

hist2 = px.histogram(df_race_majority, 
                    x="percent_death",
                   nbins = 20,
                   log_y = True,
                   title = "Distribution of deaths by racial makeup of county", 
                    color = "Race",
                     color_discrete_sequence=px.colors.sequential.Rainbow,
                     opacity = 0.8,
                     histnorm = "percent",
                     #facet_row = "Race",
                     hover_data = ["County,State", "percent_death", "percent_race"],
                     labels={'percent_death':'percent death', "percent_race":"percent race"},
                     marginal = "violin"
                   )

hist2.update_layout(xaxis_title_text = 'Percent county pop that died due to COVID-19',
                   yaxis_title_text = "percentage of counties")

plotly.offline.iplot(hist2)

Above are shown the distributions of the percentage of the county population that died due to COVID-19 for both black majority and white majority counties. There are 2,807 counties in the US that are majority white, but only 102 that are majority black, therefore, the y-axis has been standardized to percent. Majority black counties have death rates skewed further right than majority white counties (i.e. more counties have a greater percentage of COVID-related deaths).

4. What percentage of majority black vs majority white counties have had over 0.1% of their population die due to COVID-19?

In [27]:
#percent of black counties with a death rate over .1%

df_b = df_black_majority.loc[df_black_majority["percent_death"] > .1, :]
percent_b = round(((df_b.shape[0]/df_black_majority.shape[0])*100), 4)

#percent of white counties with a death rate over .1%

df_w = df_white_majority.loc[df_white_majority["percent_death"] > .1, :]
percent_w = round(((df_w.shape[0]/df_white_majority.shape[0])*100), 4)



df_bad_counties = pd.DataFrame({"Race": ["Majority white", "Majority black"],
                                "%counties death rate > 0.1%":[percent_w, percent_b]})

#plot and compare


px.defaults.width = 600
px.defaults.height = 600

bar2 = px.bar(df_bad_counties, 
              x = "Race",
              y = "%counties death rate > 0.1%",
              color = "Race",
              color_discrete_sequence=px.colors.sequential.Rainbow,
             title = "Percent of counties with a death rate > 0.1%",
             hover_data = ["%counties death rate > 0.1%"])

bar2.update_yaxes(tickprefix = "%")

plotly.offline.iplot(bar2)

As of 4/29, almost 5% of counties in the US with a majority black population have had more than 0.1% of their population die due to COVID-19. The same can be said for only 0.25% of majority white counties.

5. Broadening this scope, what percentage of majority black vs majority white counties have had at least one death due to COVID-19?

In [28]:
#majority white counties with at least 1 death due to COVID

df_white_majority_death = df_white_majority.loc[df_white_majority["Deaths"] > 0, :].sort_values("percent_death", ascending = False)


#majority black counties with at least 1 death due to COVID

df_black_majority_death = df_black_majority.loc[df_black_majority["Deaths"] > 0, :].sort_values("percent_death", ascending = False)


#plot and compare

percent_death_white = (df_white_majority_death.shape[0]/df_white_majority.shape[0])*100
percent_death_black = (df_black_majority_death.shape[0]/df_black_majority.shape[0])*100

df_county_rates = pd.DataFrame({"Race": ["Majority white", "Majority black"],
                                "Percent of counties with death":[percent_death_white, percent_death_black]})

bar3 = px.bar(df_county_rates, 
              x = "Race",
              y = "Percent of counties with death",
              color = "Race",
              color_discrete_sequence=px.colors.sequential.Rainbow,
             title = "Percent of counties with at least one COVID-related death",
             hover_data = ["Percent of counties with death"])

bar3.update_yaxes(tickprefix = "%")

plotly.offline.iplot(bar3)

As of 4/29, 77.45% of counties in the US with a majority black population have had at least one death due to COVID-19. The same can be said for only 44.3% of majority white counties. In other words, majority black counties are almost twice as likely to see death due to COVID-19.

Conclusions

Though over 2000 counties in the US have a population comprised mostly of people who identify as non-Hispanic white, counties with majority non-white populations, specifically Black or African-American, are being disproportionately affected by COVID-19. Overall death rates are higher, and they top the lists of counties with the highest COVID death rates. These data support what news sources are reporting regarding the issue. Of note, death rates are likely also impacted by socioeconomic status, healthcare access, and crowdedness (people per square mile), though these issues are also linked to racial disparities.